(N 

o 



(N 



m 






> 

in 

en 

O 



% 



Statistical Science 

2012, Vol. 27, No. 1, 82-94 

DOI: 10.1214/11-STS383 

© Institute of Mathematical Statistics, 2012 



From Minimax Shrinkage Estimation 
to Minimax Shrinkage Prediction 



Edward I. George, Feng Liang and Xinyi Xu 



Abstract. In a remarkable series of papers beginning in 1956, Charles 
Stein set the stage for the future development of minimax shrinkage 
estimators of a multivariate normal mean under quadratic loss. More 
recently, parallel developments have seen the emergence of minimax 
shrinkage estimators of multivariate normal predictive densities under 
Kullback-Leibler risk. We here describe these parallels emphasizing the 
focus on Bayes procedures and the derivation of the superharmonic con- 
ditions for minimaxity as well as further developments of new minimax 
shrinkage predictive density estimators including multiple shrinkage es- 
timators, empirical Bayes estimators, normal linear model regression 
estimators and nonparametric regression estimators. 

Key words and phrases: Asymptotic minimaxity, Bayesian prediction, 
empirical Bayes, inadmissibility, multiple shrinkage, prior distributions, 
superharmonic marginals, unbiased estimates of risk. 



1. THE BEGINNING OF THE HUNT FOR 
MINIMAX SHRINKAGE ESTIMATORS 

Perhaps the most basic estimation problem in Sta- 
tistics is the canonical problem of estimating a mul- 
tivariate normal mean. Based on the observation of 
a p-dimensional multivariate normal random variable 



cation invariant estimator for this problem, by show- 
ing that, when p > 3, /Umle is inadmissible under 
quadratic loss 

2 



(2) 



Rq(^^) = Ef.WfliX) - fi\ 



(1) 



Xlft^Npifi,!), 



the problem is to find a suitable estimator fi(x) 
of fi. The celebrated result of Stein (1956) dethroned 
Amle(^) = x, the maximum likelihood and best lo- 
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From a decision theory point of view, an important 
part of the appeal of /*mle was the protection of- 
fered by its minimax property. The worst possible 
risk Rq incurred by /2mle was no worse than the 
worst possible risk of any other estimator. Stein's 
result implied the existence of even better estima- 
tors that offered the same minimax protection. He 
had begun the hunt for these better minimax esti- 
mators. 

In a remarkable series of follow-up papers Stein 
proceeded to set the stage for this hunt. James and 
Stein (1961) proposed a new closed- form minimax 
shrinkage estimator 



(3) 



MJSW 
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the now well-known James-Stein estimator, and- 
showed explicitly that its risk was less than Rq(/j-, 
Amle) = P for every value of n when p > 3, that is, it 
uniformly dominated /Umle- The appeal of /tjs un- 
der Rq was compelling. It offered the same guaran- 



E. I. GEORGE, F. LIANG AND X. XU 



teed minimax protection as /2mle while also offering 
the possibility of doing much better. 

Stein (1962), though primarily concerned with im- 
proved confidence regions, described a parametric 
empirical Bayes motivation for (3), describing 
how fijs( x ) could be seen as a data-based approxi- 
mation to the posterior mean 

(4) EMx) 



1 



l + v 



.r. 



the Bayes rule which minimizes the average risk 
E n RQ(n,jl) when \i ~ N p (0,ul). He here also pro- 
posed the positive-part James-Stein estimator 
/zjs_|_ = max{0,/ijs}, a dominating improvement 
over /}js(x), and commented that "it would be even 
better to use the Bayes estimate with respect to 
a reasonable prior distribution." These observations 
served as a clear indication that the Bayesian para- 
digm was to play a major role in the hunt for these 
new shrinkage estimators, opening up a new direc- 
tion that was to be ultimately successful for estab- 
lishing large new classes of shrinkage estimators. 

Dominating fully Bayes shrinkage estimators soon 
emerged. Strawderman (1971) proposed (i a (x) = 
E na (fj,\x), a class of Bayes shrinkage estimators obtai- 
ned as posterior means under priors vr a (/i) for which 

(5) A i|s~JV p (0,«/), s~(l + s) a ~ 2 . 

Strawderman explicitly showed that fi a uniformly 
dominated /2mle and was proper Bayes, when p = 5 
and a £ [0.5, 1) or whenp > 6 and a £ [0, 1). This was 
especially interesting because any proper Bayes was 
necessarily admissible and so could not be improved 
upon. 

Then, Stein (1974, 1981) showed that j!h{x), the 
Bayes estimator under the harmonic prior 

(6) n H ( f ,)=E nH (v\x) = M^- 2 \ 

dominated £lmle when p > 3. A special case of p, a 
when a = 2, /}# was only formal Bayes because 7Tff (//) 
is improper. Undeterred, Stein pointed out that the 
admissibility of fin followed immediately from the 
general conditions for the admissibility of general- 
ized Bayes estimators laid out by Brown (1971). 
A further key element of the story was Brown's 
(1971) powerful result that all such generalized Bayes 
rules (including the proper ones of course) consti- 
tuted a complete class for the problem of estimating 
multivariate normal mean under quadratic loss. It 
was now clear that the hunt for new minimax shrink- 
age estimators was to focus on procedures with at 
least some Bayesian motivation. 



Perhaps even more impressive than the fact 
that fin dominated /}mle was the way Stein proved 
it. Making further use of the rich results in Brown 
(1971), the key to his proof was the fact that any 
posterior mean Bayes estimator under a prior vr(/_t) 
can be expressed as 

(7) jln(x) = En(n\x) = x + Vlogm n (x), 

where 



(8) 



m^(x)oc e^ x -^ 2/2 ir(^)dfi 



is the marginal distribution of X under 7r(/x). [Here 
V = ( Tp-, . . . , gj-)' is the familiar gradient.] 

At first glance it would appear that (7) has lit- 
tle to do with the risk. However, Stein noted that 
insertion of (7) into Rq, followed by expansion and 
an integration-by-parts identity, now known as one 
of Stein's Lemmas, yields the following general ex- 
pression for the difference between the risks of /&„■ 
and #mle: 



(9) 



(10) 



RQ(H, AMLE) - RQ(n,fin) 



E n 



|Vlog?7^(X)|| 2 -2 



V 2 m^(A) 
m n (X) 



^[-4V 2 vVrP0/vVrP0]- 



(Here V 2 = £^ jr-^ is the familiar Laplacian.) 

Because the bracketed terms in (9) and (10) do 
not depend on /j, (they are unbiased estimators of 
the risk difference), the domination of /*mle by jl n 
would follow whenever m n was such that these brack- 
eted terms were nonnegative. As Stein noted, this 
would be the case in (9) whenever m n was superhar- 
monic, V 2 m 7r (x) < 0, and in (10) whenever ^Jrn^ was 
superharmonic, \I 2 ^Jm- K {x) < 0, a weaker condition. 

The domination of /iMLE by fin was seen now to 
be attributable directly to the fact that the margi- 
nal (8) under irjj, a mixture of harmonic functions, is 
superharmonic when p > 3. However, such an expla- 
nation would not work for the domination of //mle 
by fi a , because the marginal (8) under 7r a in (5) is 
not superharmonic for any a < 1. Indeed, as was 
shown later by Fourdrinier, Strawderman and Wells 
(1998), a superharmonic marginal cannot be obtai- 
ned with any proper prior. More importantly, how- 
ever, they were able to establish that the domina- 
tion by fi a was attributable to the superharmonicity 
of Jm-K a under 7r a when p > 5 (and Strawderman 's 
conditions on a). In fact, it also followed from their 
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results that Jm- Ka is superharmonic when a G [1,2) 
and p > 3, further broadening the class of minimax 
improper Bayes estimators. 

Prior to the appearance of (9) and (10), minimax- 
ity proofs, though ingenious, had all been tailored 
to suit the specific estimators at hand. The sheer 
generality of this new approach was daunting in its 
scope. By restricting attention to priors that gave 
rise to marginal distributions with particular prop- 
erties, the minimax properties of the implied Bayes 
rules would be guaranteed. 

2. THE PARALLELS IN THE PREDICTIVE 
ESTIMATION PROBLEM EMERGE 

The seminal work of Stein concerned the canonical 
problem of how to estimate /i based on an obser- 
vation of X\fx ~ N p (fi,I). A more ambitious prob- 
lem is how to use such an X to estimate the entire 
probability distribution of a future Y from a normal 
distribution with this same unknown mean //, the 
so-called predictive density of Y. Such a predictive 
density offers a complete description of predictive 
uncertainty. 

To conveniently treat the possibility of different 
variances for X and Y, we formulate the predictive 
problem as follows. Suppose X\fj, ~ N p (fj,,v x I) and 
Y\fi ~ N p (fj,, VyI) are independent p-dimensional mul- 
tivariate normal vectors with common unknown 
mean \i but known variances v x and v y . Letting 
p(y\(J.) denote the density of Y, the problem is to find 
an estimator p(y\x) of p{y\y) based on the observa- 
tion of X = x only. Such a problem arises naturally, 
for example, for predicting Y\/j, ~ N p (fi,a 2 I) based 
on the observation of X\, . . . ,X n \p, i.i.d. ~ N p (/j,, 
cr 2 I) which is equivalent to observing X\fj, ~ N p (fj,, 
(a 2 /n)I). This is exactly our formulation with v x = 
a 2 jn and v y = a 2 . 

For the evaluation of p{y\x) as an estimator of 
p(y\/j,), the analogue of quadratic risk Rq for the 
mean estimation problem is the Kullback-Leibler 
(KL) risk 

(11) R K h(fJ-,p)= p(x\fi)L(fi,p(-\x))dx, 
where p(x\fi) denotes the density of X , and 

(12) L(j*Mx)) = Jpiylfi) log ||^| dy 

is the familiar KL loss. 

For a (possibly improper) prior distribution tt on /i, 
the average risk r(n,p) = f -RklO-^pW/-*) d/i is min- 



imized by the Bayes rule 

Pn(y\x) = E w \p(y\n)\x] 

p(y\fJ>)n(jJ,\x)dn, 



(13) 



the posterior mean of p(y\n) under tt (Aitchison, 
1975). It follows from (13) that p n (y\x) is a proper 
probability distribution over y whenever the marginal 
density of x is finite for all z (integrate w.r.t. y and 
switch the order of integration). Furthermore, the 
mean of p n (y\x) (when it exists) is equal to E„(fj,\x), 
the Bayes rule for estimating \x under quadratic loss, 
namely the posterior mean of fi. Thus, p n also car- 
ries the necessary information for that estimation 
problem. Note also that unless tt is a trivial point 
prior, such p n (y\x) will not be of the form oi p(y\fj,) 
for any \x. The range of the Bayes rules here falls 
outside the target space of the densities which are 
being estimated. 

A tempting initial approach to this predictive den- 
sity estimation problem is to use the simple plug-in 
estimator j>mle = p(u\^ = Amle) to estimate p(y\fi), 
the so-called estimative approach. This was the con- 
ventional wisdom until the appearance of Aitchison 
(1975). He showed that the plug-in estimator Pmle 
is uniformly dominated under Rkl by 



(14) 



Pu(y\x) = E nu \p(y\^)\x] 

1 
" {2tt(v x + v„)}p/2 



exp 



2{V X + Vy 



the posterior mean of p(y\p) with respect to the uni- 
form prior TTu(fi) = 1, the so-called predictive ap- 
proach. In a related vein, Akaike (1978) pointed out 
that, by Jensen's inequality, the Bayes rule p n (y\x) 
would dominate the random plug-in estimator 
p(y\fi = jX) when jl is a random draw from tt. Strate- 
gies for averaging over [i were looking better than 
plug-in strategies. The hunt for predictive shrinkage 
estimators had turned to Bayes procedures. 

Distinct from Pmle, Pu was soon shown to be the 
best location invariant predictive density estimator; 
see Murray (1977) and Ng (1980). That p v is best 
invariant and minimax also follows from the more 
recent general results of Liang and Barron (2004), 
who also showed that p\j is admissible when p = 1 . 
The minimaxity of pu was also shown directly by 
George, Liang and Xu (2006). Thus, pu, rather than 
Pmle, here plays the role played by /Imle in the 
mean estimation context. Not surprisingly, fijj = x, 
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the posterior mean under the uniform prior ttjj is 
identical to /jmle hi that context. 

The parallels between the mean estimation prob- 
lem and the predictive estimation problem came into 
sharp focus with the stunning breakthrough result of 
Komaki (2001). He proved that when p > 3, pu(y\ x ) 
itself is dominated by the Bayes rule 

(15) p H (y\x) = E 7TH [p(y\n)\x], 

under the harmonic prior tth (a*) hi (6) used by Stein 
(1974). Shortly thereafter Liang (2002) showed 
that pu(y\ x ) is dominated by the proper Bayes ru- 
l e Pa(y\x) under 7r a (/i) for which 

(16) ll\s~N p (0,sv I), s~(l + s) a - 2 , 

when v x < vq, and when p = 5 and a £ [0.5, 1) or 
p > 6 and a G [0, 1), the same conditions that Straw- 
derman had obtained for his estimator. Note 
that 7r a (yu) in (16) is an extension of (5) which de- 
pends on the constant vq. As before, t^h{^) is the 
special case of vr a (/u) when a = 2. Note that pu is 
now playing the "straw-man" role that was played 
by Amle in the mean estimation problem. 

3. A UNIFIED THEORY FOR MINIMAX 
PREDICTIVE DENSITY ESTIMATION 

The proofs of the domination of pu by pu in Ko- 
maki (2001) and by p a in Liang (2002) were both 
tailored to the specific forms of the dominating es- 
timators. They did not make direct use of the prop- 
erties of the induced marginal distributions of X 
and Y. From the theory developed by Brown (1971) 
and Stein (1974) for the mean estimation problem, it 
was natural to ask if there was a theory analogous 
to (7)-(10) which would similarly unify the domi- 
nation results in the predictive density estimation 
problem. 

As it turned out, just such a theory was estab- 
lished in George, Liang and Xu (2006), the main re- 
sults of which we now proceed to describe. The story 
begins with a representation, analogous to Brown's 
representation ft n (X) = .E 7r ( / u|X) = X + X7 logm n (X) 
in (7), that is available for posterior mean Bayes 
rules in the predictive density estimation problem. 
A key element of the representation is the form of 
the marginal distributions for our context which we 
denote by 



Lemma 1. The Bayes rule p n (y\x) in (13) can 
be expressed as 

m 7T (w;v n 



(18) 



Pir(y\ x 



-pu(y\x), 



where pu{y\ x 

given by (14), rn n (x;v : 

Hon ofX, and m n (w;v w ), 

marginal distribution of W 



m n (x;v x ) 
is the Bayes rule under ttu(^) = 1 



is the marginal distribu- 
where v w = -^ttt , is the 

Vy ,, tl x for indepen- 



dent X\/j, ~ Np(/j,,v x I) and Y\/j, ~ N p (n,v y I). 

Lemma 1 shows how the form of p-7r(y\x) is deter- 
mined entirely by pu(y\ x ) an d the form of m n (x; v x ) 
and m n (w; v w ). The essential step in its derivation is 
to factor the joint distribution of x and y into terms 
including a function of the sufficient statistic w. In- 
serting the representation (18) into the risk Rkl 
leads immediately to the following unbiased esti- 
mate for the KL risk difference between pu(y\x) 
and p n (y\x): 



Rkl{v,pu) - Rkl(ij>,Ptt) 



(19) 



p(x\n)p(y\ii)\o\ 



PA 



Pu(y\x) 



dxdy 



E„ Vw logm n (W; v w ) - Ea, Vx logm n (X; v x ). 



As one can see from (19) and the fact that v w = 
„, " " < v x , pn(y\x) would be uniformly dominated 
by pTr(y\x) whenever E '^ log m n (Z;v) is decreasing 
in v. As if by magic, the sign of -^E^ iV \ogm 7r {Z;v) 
turned out to be directly linked to the same unbiased 
risk difference estimates (9) and (10) of Stein (1974). 

Lemma 2. 




(20) 



Ov 



E llyV logm 7T (Z;v) 



E, 



/*■" 



V 2 m n (Z;v) 1 



m n (Z;v) 



^||Vlogm^(Z;T;)|| 2 



(21) =E IMiV [2V 2 y /m ir (Z;v)/y/m*(Z;v)]. 

The proof of Lemma 2 relies on Brown's represen- 
tation, Stein's Lemma, and the fact that any normal 
marginal distribution m n (z;v) satisfies 



(22) 



0_ 
Ov 



m n (z;v) = -V 2 m 7r (z;u), 



(17) 



m 7T (z;v)= p(z\n)7r(n) d/u, 



the well-known heat equation which has a long his- 
tory in science and engineering; for example, see 
for Z|/i~ N p (fi,vl) and a prior ir(fi). In terms of Steele (2001). Combining (19) and Lemma 2 with 
our previous notation (8), m n (z) = m n (z; 1). the fact that pu{y\x) is minimax yields the following 
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general conditions for the minimaxity of a predictive 
density estimator, conditions analogous to those ob- 
tained by Stein for the minimaxity of a normal mean 
estimator. 

Theorem 1. If m n (z;v) is finite for all z , then 
pTv{y\x) will be minimax if either of the following 
hold for all v w < v < v x : 

(i) m n (z;v) is superharmonic. 
(ii) y / m 7T (z;v) is superharmonic. 

Although condition (i) implies the weaker condi- 
tion (ii) above, it is included because of its conve- 
nience when it is available. Since a superharmonic 
prior always yields a superharmonic m n (z;v) for 
all v, the following corollary is immediate. 

Corollary 1. Ifm n (z;v) is finite for all z , then 
p-w{y\x) will be minimax ifir(fi) is superharmonic. 

Because tth is superharmonic, it is immediate from 
Corollary 1 that pjj is minimax. Because ■\Jm a {z;v) 
is superharmonic for all v (under suitable conditions 
on a), it is immediate from Theorem 1 that p a is mi- 
nimax. It similarly follows that any of the improper 
superharmonic t-priors of Faith (1978) or any of the 
proper generalized t-priors of Fourdrinier, Strawder- 
man and Wells (1998) yield minimax Bayes rules. 

The connections between the unbiased risk differ- 
ence estimates for the KL risk and quadratic risk 
problems ultimately yields the following identity: 



(23) 



Rkl(v,Pu) - Rkl(h,Ptt) 



1 



[Rq{H,jxu) - R Q (ii,fi n )] v dv, 



explaining the parallel minimax conditions in both 
problems. Brown, George and Xu (2008) used this 
identity to further draw out connections to establish 
sufficient conditions for the admissibility of Bayes 
rules under KL loss, conditions analogous to those 
of Brown (1971) and Brown and Hwang (1982), and 
to show that all admissible procedures for the KL 
risk problems are Bayes rules, a direct parallel of the 
complete class theorem of Brown (1971) for quadra- 
tic risk. 

4. THE NATURE OF SHRINKAGE IN 
PREDICTIVE DENSITY ESTIMATION 

The James-Stein estimator /ijs(x) in (3) provided 
an explicit example of how risk improvements for es- 
timating [i are obtained by shrinking X toward by 



rules £tn(x) = E n (n\x) shrink x toward the center 
of ir(fi), the mean of n(fj,) when it exists. (Section 6 
will describe how multimodal priors yield multiple 
shrinkage estimators.) As we saw earlier, x here plays 
the role both of j!mle(x) = x and of the formal 
Bayes estimator p>u(x) = x. 

The representation (18) reveals how p n (y\x) analo- 
gously "shrinks" the formal Bayes estimator pjj(y\x), 
but not Pmle 7^ Pu , by an adaptive multiplicative 
factor 

24 b n (x, y) = r . 

However, because p n (y\x) must be a proper probabi- 
lity distribution (whenever m» is always finite), it 
cannot be the case that b n (x,y) <1 for all y at any x. 
Thus, "shrinkage" here really refers to a reconcen- 
tration of the probability distribution of pu(y\ x ). 
Furthermore, since the mean of p n (y\x) is E^d^lx), 
this reconcentration, under unimodal priors, is to- 
ward the center of vr(/x), as in the mean estimation 
case. 

Consider, for example, what happens under tth 
which is symmetric and unimodal about 0. Figure 1 
illustrates how this shrinkage occurs for pn for var- 
ious values of x when p = 5. Figure 1 plots pu{y\x) 
and pH(y\x) as functions of y = (y±,y2, 0,0,0)' when 
v x = 1 and v y = 0.2. Note first that pu(y\x) is always 
the same symmetric shape centered at x. When x = 
(2,0,0,0,0)', shrinkage occurs by pushing the con- 
centration of pH(y\x) = bi{(x,y)pu(y\x) toward 0. 
As x moves further from (0, 0, 0, 0, 0)' to (3, 0, 0, 0, 0)' 
and (4, 0, 0, 0, 0)' this shrinkage diminishes as pH(y\x) 
becomes more and more similar to pu(y\ x )- 

As in the problem of mean estimation, the shrink- 
age by Ph manifests itself in risk reduction over pjj . 
To illustrate this, Figure 2 displays the risk dif- 
ference [Rkl{^,Pu) - Rkl^^h)] at n = (c, . . . ,c)', 
< c < 4 when v x = 1 and v y = 0.2 for dimensions 
p = 3, 5, 7, 9. Paralleling the risk reduction offered 
by fin in the mean estimation problem, the largest 
risk reduction offered by pn occurs close to /x = 
and decreases rapidly to as ||/i|| increases. [-Rkl(^j 
pu) is constant as a function of //.] At the same time, 
the risk reduction by pu is larger for larger p at each 
fixed ||/i||. 

5. MANY POSSIBLE SHRINKAGE TARGETS 

By a simple shift of coordinates, the modified James- 
Stein estimator, 



the adaptive multiplicative factor (1 — |-p)- Simi- 
larly, under unimodal priors, posterior mean Bayes ^ ' a^JSv ) ~ I 



P 



\x — b\\' 



(x-b), 



x - (2, 0, 0, 0, 0} 
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x * (3, 0, 0, 0, 0} 




x - (4, 0, 0, 0, 0) 



ii ii4 






P 






03- 






FlG. 1. Shrinkage of pu(y\x) to obtain ph(jj\x) when v x = 1, v y = 0.2 andp = 5. Here y = (2/1,2/2, 0,0,0)'. 



remains minimax, but now shrinks x toward b € RP 
where its risk function is smallest. Similarly, min- 
imax Bayes shrinkage estimators of a mean or of 
a predictive density can be shifted to shrink to- 
ward b, by recentering the prior n(fj,) to vr 6 (/i) = 
7r(/i — b). These shifted estimators are easily ob- 
tained by inserting the corresponding translated mar- 
ginal 

(26) m^(z;v) =m n (z — b;v) 
into (7) to obtain 

(27) fa{x) = &Mx) = x + Vlogm* (a:; 1), 
and into (18) to obtain 

m\{w\v v 



(28) 



ti( 



m\{x\v x ) 



pu{y\x) 
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Fig. 2. The risk difference between pu and pu when 



A*= (c, ...,c)', v x = 1, v y 



:0.2. 



Recentered unimodal priors such as tt h and 7r* yield 
estimators that now shrink x and Pu(y\ x ) toward b 
rather than toward 0. Since the superharmonic prop- 
erties of ?7v are inherited by ra\, the minimaxity of ihed to shrink toward (x, . . . , x)' G RP (x is the mean 
such estimators will be preserved. of the components of x), by replacing b and (p - 2) 

In his discussion of Stein (1962), Lindley (1962) in (25) by (x,...,x)' and (p — 3), respectively. The 
noted that the James-Stein estimator could be mod- resulting estimator remains minimax as long as p > 4 
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and offers smallest risk when /j, is close to the sub- 
space of p with identical coordinates, the subspace 
spanned by the vector l p = (1, . . . , 1)'. Note that 
(x, . . . ,x)' is the projection of x into this subspace. 
More generally, minimax Bayes shrinkage estima- 
tors of a mean or of a predictive density can be sim- 
ilarly modified to obtain shrinkage toward any (pos- 
sibly affine) subspace B C RP, whenever they corre- 
spond to spherically symmetric priors. Such priors, 
which include irjj and ir a , are functions of p only 
through \\p\\- Such a modification is obtained by re- 
centering the prior 7r(/i) around B via 

(29) TT B (p) = 7r(p-P B p), 

where PbP = argmin beB ||// — 6|| is the projection 
of p onto B. Effectively, it (p) puts a uniform prior 
on PbP and applies a suitably modified version of n 
to (/i — Pbp) ■ Note that the dimension of (p — Pb^) , 
namely (p — dim(B)), must be taken into account 
when determining the appropriate modification for ir. 
For example, recentering the harmonic prior tth(p) = 
\\p\\~ (P -2 ) around the subspace spanned by l p yields 

(30) 7r B (p) = \\p-pl p \\^ p - 3 \ 

where Jx = p'l p /p. Here, the uniform prior is put 
on PbP = flip, and the harmonic prior in dimen- 
sion {p — dim(B)) = (p — 1) (which is different from 
the harmonic prior in R p ) is put on (p — flip), the 
orthogonal complement of B. 

The marginal m B corresponding to the recente- 
red tt b in (29) can be directly obtained by recen- 
tering the spherically symmetric marginal m^ cor- 
responding to 7r, that is, 



and predictive density estimators 
m B (w;v. 



(31) 



m 7T {z;v) =m 7r (z-P B z;v), 



where Pbz is the projection of z onto B. Analo- 
gously to ir B (p), m B (z;v) is uniform on Pbz and 
applies a suitably modified version of m n to (z — 
Pbz). Here, too, the dimension of (z — Pb z ), namely 
(p — dim(.B)), must be taken into account when de- 
termining the appropriate modification for m f . For 
example, recentering the marginal m^ around the 
subspace spanned by l p would entail replacing \\z\\ 
by \\z — zlp\\, where z = z'l p /p, and appropriately 
modifying m n to apply to R p ~ l . 

Applying the recentering (29) to priors such as %b. 
and ir a , which are unimodal around 0, yields pri- 
ors 7rj^ and tt b and hence marginals m^ and m B , 
which are unimodal around B. Such recentered mar- 
ginals yield mean estimators 

(32) p B (x) = E B {p\x) = x + Vlogm B (x; 1), 



(33) 



p*(v\ x ) 



m B (x;v x ) 



pu(y\x), 



that now shrink x and pu(y\ x ) toward B rather than 
toward 0. Shrinkage will be largest when x 6 B, and 
will diminish as x moves away from B. These esti- 
mators offer smallest risk when p, € B, but do not 
improve in any important way over x and Pu(y\ x ) 
when p is far from B. 

A superharmonic m n will lead to a superharmo- 
nic m B as long as (p — dim(B)) is large enough. 
For example, the recentered marginal m^ will be 
superharmonic only when (p— dim(S)) > 3. In such 
cases, the minimaxity of both p B and p B will be 
preserved. 

6. WHERE TO SHRINK? 

Stein's discovery of the existence of minimax shrin- 
kage estimators such as /}j S (x) in (25) demonstrated 
that costless improvements over the minimax /2mle 
were available near any target preselected by the 
statistician. As Stein (1962) put it when referring 
to the use of such an estimator to center a con- 
fidence region, the target "should be chosen. . . as 
one's best guess" of p. That frequentist considera- 
tions had demonstrated the folly of ignoring subjec- 
tive input was quite a shock to the perceived "ob- 
jectivity" of the frequentist perspective. 

Although the advent of minimax shrinkage esti- 
mators of the form p B in (32) and p B in (33) opened 
up the possibility of small risk near any preselected 
(affine) subspace B C R p (this includes the possi- 
bility that B is a single point), it also opened up 
a challenging new problem, how to best choose such 
a B. From the vast number of possible choices, the 
goal was to choose B close to the unknown /i, oth- 
erwise risk reduction would be negligible. To add to 
the difficulties, low-dimensional B, which offered the 
greatest risk reduction, were also the most difficult 
to get close to p. 

When faced with a number of potentially good 
target choices, say B\, . . . ,Bpj, rather than choose 
one of them and proceed with p B or p B , an attrac- 
tive alternative is to use a minimax multiple shrink- 
age estimator; see George (1986a, 1986b, 1986c). 
Such estimators incorporate all the potential targets 
by combining them into an adaptive convex com- 
bination of p Bl , . . . , p BN for mean estimation, and 
of p Bl , . . . ,p Bn for predictive density estimation. By 



E. I. GEORGE, F. LIANG AND X. XU 



adaptively shrinking toward the more promising tar- 
gets, the region of potential risk reduction is vastly 
enlarged while at the same time retaining the safety 
of minimaxity. 

The construction of these minimax multiple shrink- 
age estimators proceeds as follows, again making 
fundamental use of the Bayesian formulation. For 
a spherically symmetric prior n(fi), a set of sub- 
spaces B\, . . . ,Bn of R p , and a set of nonnegative 
weights wi,...,wn such that Yli Wi = l, consider 
the mixture prior 



(34) 



7T*(/i) 



N 

E 



WiTT Bi (n) 



where each w * is a recentered prior as in (29). To 
simplify notation, we consider the case where 
each 7T * is a recentering of the same 7r, although in 
principle such a construction could be applied with 
different priors. The marginal m* corresponding to 
the mixture prior tv* in (34) is then simply 



A' 



(35) 



m*(z;v) = } j w i m^ i (z;v), 



where m^ i are the recentered marginals correspond- 
ing to the ir Bi as given by (31). 

Applying Brown's representation p, n = x + 
Vlogm-n^x; 1) from (7) with m* in (35) immediately 
yields the multiple shrinkage estimator of fi, 

N 

(36) fi,(x) = Y,P(B l \x)^(x), 

where 

(37) p(Bi\x) 



i=\ 



Wim Bi (x; 1) 



-N 



B, 



Similarly, applying the representation p T (y|x) = 
— u : , -y W \ , Pu{y\ x ) from (18) with m* immediately 
yields the multiple shrinkage estimator of p(y\[J>), 

N 

(38) p*(y\x) = Y,P(Bi\x)pf>(y\x), 

where 

(39) p(Bi\x 



t=i 



Wim^(x;v x ) 



-N 



YliLl w i m 7T'(x;V x ) 

The forms (36) and (38) reveal /2* and p* to be 
adaptive convex combination of the individual pos- 
terior mean estimators fi Bi and p Bi , respectively. 



The adaptive weights p(Bi\x) in (37) and (39) are 
the posterior probabilities that fj, is contained in 
each of the Bi, effectively putting increased weight 
on those individual estimators which are shrinking 
most. Note that the uniform prior estimates flu 
and pu are here doubly shrunk by /}* and p*(y\x); 
in addition to the individual estimator shrinkage 
they are further shrunk by the posterior probabil- 
ity p(Bi\x). 

The key to obtaining /2* and p*(y\x) which are 
minimax is simply to use priors which yield super- 



harmonic m Bl 
trivially from (35) 



,m 



B» 



If such is the case, then 



N 



(40) 



V 2 m* = ^2 WiV 2 mf l < 0, 



so that m* will be super harmonic, and the minimax- 
ity of /t* and p*(y\x) will follow immediately. Note 
that marginals whose square root is superharmonic 
will not be adequate, as this argument will fail. 

The adaptive shrinkage behavior of /}* and p* ma- 
nifests itself as substantial risk reduction whenever \x 
is near any of B\, ... , Bjy. Let us illustrate how that 
happens for the predictive density estimator pu* , 
the multiple shrinkage version otpH- Figure 3 illus- 
trates the risk reduction [i?KL(/^R/) — -Rkl(a*, Ph*)] 
at various \x = (c, . . . , c)' obtained by pn* which adap- 
tively shrinks pu(y\x) toward the closer of the two 
points h\ = (2, . . . , 2)' and 62 = (—2, . . . , —2)' using 
equal weights w\ = W2 = 0.5. As in Figure 2, we 




Fig. 3. The risk difference between pu and multiple shrink- 
age ph* when fj, — (c, . . . , c)' , v x = 1, v y — 0.2, 61 = (2, ... , 2)', 
62 = (—2, . . . , —2)', and wi = W2 = 0.5. 
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considered the case v x = 1, v y = 0.2 for p = 3, 5, 7, 9. 
As the plot shows, maximum risk reduction occurs 
when n is close to b\ or 62 , and goes to as \x moves 
away from either of these points. At the same time, 
for each fixed ||/x||, risk reduction by pu* is larger for 
larger p. It is impressive that the size of the risk re- 
duction offered by fin* is nearly the same as each of 
its single target counterparts. The cost of multiple 
shrinkage enhancement seems negligible, especially 
compared to the benefits. 

7. EMPIRICAL BAYES CONSTRUCTIONS 

Beyond their attractive risk properties, the James- 
Stein estimator /tjg and its positive-part counter- 
part /ijs+ are especially appealing because of their 
simple closed forms which are easy to compute. As 
shown by Xu and Zhou (2011), similarly appealing 
simple closed-form predictive density shrinkage esti- 
mators can be obtained by the same empirical Bayes 
considerations that motivate /ijs and /2js+- 

The empirical Bayes motivation of /tjg, alluded 
to in Section 1, simply entails replacing 1/(1 + v) 
in (4) by (p — 2)/||x|| 2 , its unbiased estimate under 
the marginal distribution of X\fi ~ N p (/j,,I) when 
/i ~ N p (0, vT). The positive-part /2js+ is obtained by 
using the truncated estimate (p — 2)/max{l, ||x|| 2 } 
which avoids an implicitly negative estimate of the 
prior variance v. 

Proceeding analogously, Xu and Zhou considered 
the Bayesian predictive density estimate, 



p v {y\x)~N p [ (1 

(41) 



v x + v 



v x + u 



-v v +[l 



v x + v 



{V X + Vy) 



y) 1 > 



when X\fj, ~ N p (p, v x I) and Y\fj, ~ N p (fj,, v y I) are in- 
dependent, and fjL ~ N p (0,i>I). Replacing v x /(v x + 
v) by its truncated unbiased estimate (p — 2)v x / 
maxjiij, ||ic|| 2 } under the marginal distribution oi X, 
they obtained the empirical Bayes predictive density 
estimate 



P P -2(y\x) ~ N p 



(42) 



(p - 2)v x 



.r; 



Vy+ \ 1 



(p ~ 2)v x 



where (•)+ = max{0,-}, an appealing simple closed 
form. Centered at /tjs+, P P -2 converges to the best 



invariant procedure pjj ~ N(x,v x + v y ) as ||3;|| z — > 
00, and converges to N(0,v y ) as ||x|| 2 — > 0. Thus, 



Pp-2 can be viewed as a shrinkage predictive density 
estimator that "pulls" p\j toward 0, its shrinkage 
adaptively determined by the data. 

To assess the KL risk properties of such empirical 
Bayes estimators, Xu and Zhou considered the class 
of estimators pk of the form (42) with (p — 2) re- 
placed by a constant k, a class of simple normal 
forms centered at shrinkage estimators of [i with 
data-dependent variances to incorporate estimation 
uncertainty. For this class, they provided general 
sufficient conditions on k and the dimension p for pk 
to dominate the best invariant predictive density pjj 
and thus be minimax. Going further, they also es- 
tablished an "oracle" inequality which suggests that 
the empirical Bayes predictive density estimator is 
asymptotically minimax in infinite-dimensional pa- 
rameter spaces and can potentially be used to con- 
struct adaptive minimax estimators. It appears that 
these minimax empirical Bayes predictive densities 
may play the same role as the James-Stein estima- 
tor in such problems. 

It may be of interest to note that a particular 
pseudo-marginal empirical Bayes construction that 
works fine for the mean estimation problem appears 
not to work for the predictive density estimation 
problem. For instance, the positive-part James-Stein 
estimator /ijs+ can be expressed as /ijs+ = % + 
Vlog??i,js+(2;; 1), where mjs + (x;v) is the function 

mjs+(x;v) 

' k p \\x\\-( p -V if ||x|| 2 /t;> (p-2), 

v-<P- 2 V 2 exp{-\\x\\ 2 /2v} 

if ||x|| 2 /f <(p— 2), 

with k p = (e/(p- 2))-( p ~ 2 )/ 2 (see Stein, 1974). We 
refer to m(z;v) as a pseudo-marginal because it is 
not a bona fide marginal obtained by a real prior. 
Nonetheless, it plays the formal role of a marginal 
in the mean estimation problem, and can be used to 
generate further innovations such as minimax mul- 
tiple shrinkage James-Stein estimators (see George, 
1986a, 1986b, 1986c). 

Proceeding by analogy, it would seem that m(z; v) 
could be inserted into the representation (18) from 
Lemma 1 to obtain similar results under KL loss. 
Unfortunately, this does not yield a suitable mini- 
max predictive estimator because pjs+(y\x) is not 
a proper probability distribution. Indeed, J pjs+(y\ 
x) dy 7^ 1 and varies with x. What has gone wrong? 
Because they do not correspond to real priors, such 
pseudo-marginals are ultimately at odds with the 
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probabilistic coherence of a valid Bayesian approach. 
In contrast to the mean estimation framework, the 
predictive density estimation framework apparently 
requires stronger fidelity to the Bayesian paradigm. 

8. PREDICTIVE DENSITY ESTIMATION FOR 
CLASSICAL REGRESSION 

Moving into the multiple regression setting, Stein 
(1960) considered the estimation of a p-dimensional 
coefficient vector under suitably rescaled quadratic 
loss. He there established the minimaxity of the max- 
imum likelihood estimators, and then proved its in- 
admissibility when p > 3, by demonstrating the ex- 
istence of a dominating shrinkage estimator. 

In a similar vein, as one might expect, the theory 
of predictive density estimation presented in Sec- 
tions 2 and 3 can also be extended to the multiple re- 
gression framework. We here describe the main ideas 
of the development of this extension which appeared 
in George and Xu (2008). Similar results, developed 
independently from a slightly different perspective, 
appeared at the same time in Kobayashi and Ko- 
maki (2008). 

Consider the canonical normal linear regression 
setup: 

(43) X\P~N rn (Ap,a 2 I), Y\(3 ~ N n (B(3,a 2 I), 

where A is a full rank, fixed m x p, B is a fixed nxp 
matrix, and /3 is a common p x 1 unknown regression 
coefficient. The error variance a 2 is assumed to be 
known, and set to be 1 without loss of generality. 
The problem is to find an estimator of p(y\x) of the 
predictive density p(y\/3), evaluating its performance 
by KL risk 

(44) Rkl(P,p) = j p(x\/3)L(/3,p(-\x)) dx, 

where L(/3,p(-\x)) is the KL loss between the density 
p(y\/3) and its estimator p(y\x). 

The story begins with the result, analogous to 
Aitchison's (1975) for the normal mean problem, 
that the plug-in estimator p(y\$ x ), where (3 X is the 
least squares estimate of /3 based on x, is dominated 
under KL risk by the posterior mean of p(y\/3), the 
Bayes rule under the uniform prior 

- , , v 1 \A'A + B'B\- 1 / 2 

pu(y\x) 

(45) 



(2tt)™/ 2 ' 



| A' A 
RSS 



j, ii 



-1/2 

— Rao r. 



x exp 



Here, too, pu is minimax (Liang, 2002; Liang and 
Barron, 2004) and plays the straw-man role of the 



estimator to beat. The challenge was to determine 
which priors n would lead to Bayes rules which dom- 
inated pu, and hence would be minimax. Analo- 
gously to the representation (18) in Lemma 1 for 
the normal mean problem, the following representa- 
tion for a Bayes rule p n (y\x) here, was the key to 
meeting this challenge. 

Lemma 3. The Bayes rule p n (y\x) = j p(y\f3) x 
7r(/3) d/3 can be expressed as 

(46) MvW = m ^f^ pu(y\x), 

whereZ A = {A'A)-\C = A'A + B'B,Z c = (C'C)- 1 ! 
f3 x is the least squares estimates of (3 based on x, 
and (3 x ,y based on x and y, and m 7T (z;T 1 ) is the 
marginal distribution of Z\/3 ~ iV p (/3,S) under 7r(/3). 

The representation (46) leads immediately to the 
following analogue of (19) for the KL risk difference 
between pu(y \ x) and p 7T (y\x): 

RklW,Pu) ~ RKh(f3,Pn) 

(47) = J B /3iSc logm 7r (/3 :ri j / ;Sc) 

-Ep,s A log m„ X ;Ea)- 

The challenge thus became that of finding conditions 
on m w to make this difference positive, a challenge 
made more difficult than the previous one for (19) 
because of the complexity of £yi and So Fortu- 
nately this could be resolved by rotating the prob- 
lem as follows to obtain diagonal forms. Since Tia 
and Sc are both symmetric and positive definite, 
there exists a full rank p x p matrix W, such that 

S A = WW', S c = WDW\ 
(48) 

JD = diag(di,...,dp). 

Because Y, c = (S^ 1 + B 1 B)~ l where B'B is non- 
negative definite, it follows that di S (0, 1] for all 
\<i<p with at least one di < 1. Thus, the param- 
eters for the rotated problem become 

V = W~ 1 (3, fL x = W- l fi x ~N p {^I), 
(49) 

(i x ,y = W- 1 (3 Xyy ~N p {Li,D). 

Letting V w = wl + (1 — w)D for w G [0, 1], the risk 
difference (47) could be reexpressed as 

Rkl(P,Pu) ~ Rkl((3,Pit) 

= E„d log m nw {jx x , y ; D) 
(50) 

-E ll j\ogm 7Tw (fl x ;I) 

= h„(V )-h fl (V 1 ), 
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where h^(V w ) = E fJi y w logm nw (Z; V w ) and 7Tw(/i) = 
ir(Wfi). The minimaxity of p n would now follow 
from conditions on m n such that (d/dw)h^(w) < 
for all fi and w £ [0,1]. The following substantial 
generalizations of Theorem 1 and Corollary 1 pro- 
vide exactly those conditions. 

Theorem 2. Suppose m n (z;WW') is finite for 
all z with the invertible matrix W defined as in (48)- 
Let H(f(z\, . . . , z p )) be the Hessian matrix of f . 

(i) // trace{H( miT (z;WV w W'))[E A - S c ]} < 
for all w G [0, 1] , th en p n (y\x) is mi nimax. 

(ii) //trace{ J H'( v / m 7r (z; WV W W>))[^ A -^ C ]} < 
for all w € [0, 1], then p n (y\ x ) is minimax. 

Corollary 2. Suppose m 7T (z;WW') is finite 
for all z. Then Pwivlx) is minimax if 

trace{F(vr(/3))[S A -S c ]}<0 a.e. 

As a consequence of Corollary 2, the scaled har- 
monic prior tth(/3\W) oc ||W /_1 /3|| p_2 can be shown 
to yield minimax predictive density estimators for 
the regression setting. 

Going further, George and Xu (2008) went on to 
show that the minimax Bayes estimators here can 
be modified to shrink toward different points and 
subspaces as in Section 5, and that the minimax 
multiple shrinkage constructions of Section 6 apply 
as well. In particular, they obtained minimax mul- 
tiple shrinkage estimators that naturally accommo- 
date variable selection uncertainty. 

9. PREDICTIVE DENSITY ESTIMATION FOR 
NONPARAMETRIC REGRESSION 

Moving in another direction, Xu and Liang (2010) 
considered predictive density estimation in the con- 
text of modern nonparametric regression, a context 
in which the James-Stein estimator has turned out 
to play an important asymptotic minimaxity role; 
see Wasserman (2006). Their results pertain to the 
canonical setup for nonparametric regression: 

(51) Y{t i ) = f{t i ) + e i , i = l,...,n, 

where / is an unknown smooth function in £ 2 [0, 1], 
U = i/n, and £j's are i.i.d. iV(0, 1). A central prob- 
lem here is to estimate / or various functionals of / 
based on observing Y = (Y(ti), . . . ,Y(t n )). Trans- 
forming the problem with an orthonormal basis, (51) 
is equivalent to estimating the #j's in 



known as the Gaussian sequence model. The model 
above is different from the ordinary multivariate nor- 
mal model in two aspects: (1) the model dimension n 
is increasing with the sample size, and (2) under 
function space assumptions on /, the #j's lie in a con- 
strained space, for example, an ellipsoid {J^i -!^ — 
C,ai — > oo}. 

A large body of literature has been devoted to 
minimax estimation of / under C 2 risk over certain 
function spaces; see, for example, Johnstone (2003), 
Efromovich (1999), and the references therein. As 
opposed to the ordinary multivariate normal mean 
problem, exact minimax analysis is difficult for the 
Gaussian sequence model (52) when a constraint 
on the parameters is considered. This difficulty has 
been overcome by first obtaining the minimax risk of 
a subclass of estimators of a simple form, and then 
showing that the overall minimax risk is asymptot- 
ically equivalent to the minimax risk of the sub- 
class. For example, an important result from Pinsker 
(1980) is that when the parameter space is con- 
strained to an ellipsoid, the nonlinear minimax risk 
is asymptotically equivalent to the linear minimax 
risk, namely the minimax risk of the subclass of lin- 
ear estimators of the form 8i = CiXi. 

For nonparametric regression, the following ana- 
logue between estimation under C 2 risk and predic- 
tive density estimation under KL risk was estab- 
lished in Xu and Liang (2010). The prediction prob- 
lem for nonparametric regression is formulated as 
follows. Let Y = (Y(ui), . . . ,Y(u m )) be future ob- 
servations arising at a set of dense (m > n) and 
equally spaced locations {uj}YL\- Given /, the pre- 
dictive density p(y\f) is just a product of Gaussians. 
The problem is to find an estimator p(y\y) of p(y\f), 
where performance is measured by the averaged KL 
risk 



(53) 



R(f,P) 



l 



-Ek 



■ lo &T77, 



P(Y\f) 



(52) y i = e i + e i , ei ~N[o,- 

n 



i = l,...,n, 



m y W "p(Y\Y) 

In this formulation, densities are estimated at the m 
locations simultaneously by p(y\y). As it turned out, 
the KL risk based on the simultaneous formula- 
tion (53) is the analog of the C 2 risk for estima- 
tion. Indeed, under the KL risk (53), the prediction 
problem for a nonparametric regression model can 
be converted to the one for a Gaussian sequence 
model. 

Based on this formulation of the problem, mini- 
max analysis proceeds as in the general framework 
for the minimax study of function estimation used 
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by, for example, Pinsker (1980) and Belitser and Le- 
vit (1995, 1996). The linear estimators there, which 
play a central role in their minimax analysis, take 
the same form as posterior means under normal pri- 
ors. Analogously, predictive density estimates under 
the same normal priors turned out to play the cor- 
responding role in the minimax analysis for predic- 
tion. (The same family of Bayes rules arises from 
the empirical Bayes approach in Section 7.) Thus, 
Xu and Liang (2010) were ultimately able to show 
that the overall minimax KL risk is asymptotically 
equivalent to the minimax KL risk of this subclass of 
Bayes rules, a direct analogue of Pinker's Theorem 
for predictive density estimation in nonparametric 



regression. 



10. DISCUSSION 



Stein's (1956) discovery of the existence of shrink- 
age estimators that uniformly dominate the mini- 
max maximum likelihood estimator of the mean of 
a multivariate normal distribution under quadratic 
risk when p > 3 was the beginning of a major re- 
search effort to develop improved minimax shrinkage 
estimation. In subsequent papers Stein guided this 
effort toward the Bayesian paradigm by providing 
explicit examples of minimax empirical Bayes and 
fully Bayes rules. Making use of the fundamental 
results of Brown (1971), he developed a general the- 
ory for establishing minimaxity based on the super- 
harmonic properties of the marginal distributions 
induced by the priors. 

The problem of predictive density estimation of 
a multivariate normal distribution under KL risk has 
more recently seen a series of remarkably parallel de- 
velopments. With a focus on Bayes rules catalyzed 
by Aitchison (1975), Komaki (2001) provided a fun- 
damental breakthrough by demonstrating that the 
harmonic prior Bayes rule dominated the best in- 
variant uniform prior Bayes rule. These results sug- 
gested the existence of a theory for minimax esti- 
mation based on the superharmonic properties of 
marginals, a theory that was then established in 
George, Liang and Xu (2006). Further developments 
of new minimax shrinkage predictive density estima- 
tors now abound, including, as described in this ar- 
ticle, multiple shrinkage estimators, empirical Bayes 
estimators, normal linear model regression estima- 
tors, and nonparametric regression estimators. Ex- 
amples of promising further new directions for pre- 
dictive density estimation can be found in the work 
of Komaki (2004, 2006, 2009) which included results 
for Poisson distributions, for general location-scale 



models and for Wishart distributions, in the work 
of Ghosh, Mergel and Datta (2008) which developed 
estimation under alternative divergence losses, and 
in the work of Kato (2009) which established im- 
proved minimax predictive domination for the mul- 
tivariate normal distribution under KL risk when 
both the mean and the variance are unknown. Min- 
imax predictive density estimation is now beginning 
to flourish. 
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